Tiago B.N. Silveira, MSc.
July 15, 2018
This project is part of the Udacity Data Analysis Nanodegree and aims to apply exploratory data analysis (EDA) techniques using R in order to find out relationships in one to multiple variables, as well as exploring a given dataset for distributions, outliers, and anomalies.
EDA can thus be understood as the numerical and graphical examination of data characteristics and relationships before formal, rigorous statistical analyses are applied. EDA can lead to insights, which may uncover to other questions, and eventually predictive models. As explained during the Udacity course, EDA is also an important “line of defense” against bad data and is an opportunity to notice that some assumptions or intuitions about a dataset are violated.
The first step before diving down EDA is to provide a short introduction into the dataset we are planning to explore. In this case, from a list of available options I have chosen the “Wine Quality” dataset (available at http://dx.doi.org/10.1016/j.dss.2009.05.016) since I’m a wine lover.
Said that, and since our data is already loaded, the first thing to do is to take a look at the very first line of the dataset:
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
We already knew it is a tidy dataset. Anyway, it was possible to check there is any missing value (NA values, for example) through the summary output.
To understand each data attributes, it’s helpful to consult the data description where the following information is found:
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [1]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc).
It’s also important to notice that besides the target feature is the wine quality, the dataset is unbalanced regarding to it. It means there are few excellent or poor wines. This issue will be addressed later. For now, let’s start taking a look at each attribute in order to understand our data in deepen way.
In order to catch the meaning of each variable in the dataset, in this section we will seek to explore each attribute in what is known as univariate analysis. The most relevant attributes of this dataset are consolidated at the end of this section.
Since it’s the target attribute, there is any doubt about its relevance. That way, let us first understand its distribution.
The first thing to notice is that there are few really bad wines, i.e. those whose quality is below 3. In the same way, there are few really good wines, ie, those whose quality is above 8. It’s also effort to notice this distribution seems like a normal one.
Considering this is the target attribute, all the upcoming univariate analysis will be in relation to this attribute. Due to it I will first write a simple function to plot the histogram for a given variable, together with a boxplot showing out the outliers.
This attribute refers to the percent alcohol content of the wine (% of volume).
We can notice from the boxplot there are few outliers in this attribute distribution, i.e. most of the samples stands inside \(1.5\) times the interquartile rate (IQR). From the histogram, it seems to have a clear distinction between the alcohol distributions for each quality level. In order to see it clearly, let’s plot those histograms separately:
Accordingly to the dataset documentation, it refers to a wine additive which can contribute to sulfur dioxide gas (S02) levels. This last acts as an antimicrobial and antioxidant. The sulphates levels are measured by the concentration of potassium sulphate in g/dm³.
To find out the range of sulphates in the evaluated wines, we can just summarize this attribute, as follow:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0,3300 0,5500 0,6200 0,6581 0,7300 2,0000
As we did before, the next graphs show how this attribute is distributed along our data and how it’s related to our target variable:
The values of this attribute are distributed around the mean and there is any clear correlation to the wine quality, at least at this first glance. Anyway, the information provided by the first chart was already obtained from summary, although from the histogram it becomes clear the almost normal distribution with right-skewed values. The outliers are noticeable from both charts too.
This attribute describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Again, our first step will be checking the values for this attribute using the summary command, whose values around 3.31 confirm the above statement.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2,740 3,210 3,310 3,311 3,400 4,010
Trying to see this same information through the histogram, in order to confirm visually if it follows a normal distribution:
As shows above in the plots, the wine pH follows a quasi-normal distribution. We can also confirm that it varies in the range of 3.0 to 3.6, also with some outliers beyond this range.
In my opinion, the wines that are best evaluated have a higher pH. To validate my supposition I will plot again the pH distribution according to the quality assessment by winemakers:
From the last chart I see I was wrong. Besides my personal taste goes for wines not so acidic, the pH can’t be directly related to the quality.
The wine densities vary accordingly to density of the water and on the percent alcohol and sugar content. I particularly prefer full-bodied wines to those lighter ones, characteristics I immediately associate to density. But as far as I know, there are both strong and soft great wines, in the way I don’t think density would be a decision attribute for wine quality. Let’s check?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0,9901 0,9956 0,9968 0,9967 0,9978 1,0037
From the summary information, the variance is about 0.01 or 1%. It means a variation of just 1g for each 100ml of wine. I wonder if humans are able to perceive such small variance, which has lead me to an interesting reference [2] where it is said that the term “body” or “weight” of a wine, besides is almost always a desirable aspect, is a wine property not fully defined. In the same reference, what called my attention is that sweetness and the presence of glycerol can increase the perception of body, while acidity appears to reduce the perception of body. This description has clarified my confusion around wine density, in the way we now follow with the analysis:
Just for curiosity, I’d like to see where the mean density for each quality category stands on the chart above. In order to do this, I should first group the data I’m interested in (using the mean):
## quality density
## 1 3 0,9974640
## 2 4 0,9965425
## 3 5 0,9971036
## 4 6 0,9966151
## 5 7 0,9961043
## 6 8 0,9952122
After calculating the density mean for each quality level, I’d like to see where they fall in the density histogram, as follow:
Again, it’s interesting to perceive how my initial idea was wrong: the wines assigned with higher quality have lower density in average. Let’s move forward with a similar analysis for the other attributes.
As explained in the dataset description, the sulfur dioxide \(SO_2\) can be found in its free-form, as a dissolved gas or as a bisulfite ion, or in bounded forms with other molecules. Besides its function to prevent microbial growth and wine oxidation, it is mostly undetectable in wine. Free concentrations over 50 ppm, though, turns it evident in the nose and taste of wine.
## free.sulfur.dioxide total.sulfur.dioxide
## Min. : 1,00 Min. : 6,00
## 1st Qu.: 7,00 1st Qu.: 22,00
## Median :14,00 Median : 38,00
## Mean :15,87 Mean : 46,47
## 3rd Qu.:21,00 3rd Qu.: 62,00
## Max. :72,00 Max. :289,00
The summary above shows out there must be an outlier on data. Also, considering the previous statements, I believe the quality of wine is inversely proportional to the presence of sulfur dioxide, which I will seek to validate through the charts:
In the chart above I intended to plot the histogram both for the free form and the total amount of sulfur dioxide. The column on the left the histograms are shown in a linear x-axis, from which is clear to observe that the total amount follows the same shape as for the free form, with a higher number as expected. This graphic layout, though, does not allow to compare how the sulfur dioxide is present considering the wine quality. For this reason, the column on the right shows the same data now in a logarithmic x-axis, which in turns shows this attribute is spread out in the same way for all quality assignments.
Since there are some outliers, as noticed in the beggining when summarizing the data, I will make use of a bloxplot to check how distant it is (or they are) from the mean, in the same way it was done for the previous attributes:
From the boxplot it’s clear to see there are two outliers whose amount of sulfur dioxide goes above \(200 {mg}/{dm}^3\). I will back to this point later during the multivariate analysis.
This attribute stands for the amount of salt (sodium chloride) in the wine, measured in \(g/{dm}^3\), i.e. in grams per liters.
From the boxplot we notice there are outliers in the right-skewed chlorides distribution. From the histogram, it’s interesting to observe that the lowest and highest quality levels are around the same salt values.
This attribute stand for the amount of sugar remaining after fermentation stops. Accordingly to the dataset documentation, it’s rare to find wines with less than 1 gram/liter (or \(g/{dm}^3\) in the International System of Units - SI). On the other hand, wines with more than 45 grams/liter of sugar are categorized as sweet.
Analyzing the histogram above, even though there are some outliers, none of them is above \(45 g/{dm}^3\), which means that the whole dataset is about dry wines.
There are three attributes in the dataset related to the wine acidity, all of them measured in \(g/{dm}^3\):
* Fixed acidity: normally refers to tartaric acid which are involved with wine and do not evaporate readily.
* Volatile acidity: the amount of acetic acid in wine, which at too high levels can be unpleasant, tasting like a vinegar.
* Citric acid: usually found in small quantities, citric acid can add some freshness and flavor to wines.
Considering the description above, it becomes clear the difference among the acidities type and their impact on the wine quality. For this reason, the next charts plot not only a histogram for each acidity type, but also the minimum, maximum and mean values for each one considering only the wines assigned with a quality level higher than 7.
The histograms above seemed reasonable to me. What called my attention, though, are the maximum values for the wines assigned with high-quality level. Especially when considering the volatile acidity, which from description is the attribute responsible for an unpleasant tasting, I expected the maximum value to be lower. In [2] there are many discussions regarding acidity in wine, but the typification differs from the presented in the data description. This is an aspect of the data that should be investigate in more detail.
In this Section I explored the Red Wine dataset built from a scientific research about whose objective was to model wine preferences by data mining from physicochemical properties [1].
This first analysis, called univariate because it considers only one attribute at a time, was essential since it enabled an understanding of the subject to be explored, the available data, and the desired information (the relationship between the measured attributes and the best quality assignments). Finally, knowing the meaning of each attribute and its relation to the evaluated “phenomenon” is essential for obtaining insights and a consequent successful analysis.
At this point, we are already able to select some important attributes in order to check how they relate to each other:
The quality attribute will be our targeted variable throughout the whole analysis. As a premise, this analysis objective is to find out which attributes are related to wines of better quality, ie, those assigned with a quality level \(\geq 7\).
Considering the premise above and the findings of the univariate analysis, alcohol becomes an important feature, since it clearly separates high-quality wines from the low-quality ones.
The pH and density attributes become relevant for a further analysis since they are related to the body sensation, as explored above.
Some other attributes, such as sulfur dioxide and chlorides, seem to be interesting in some aspect and will be explored in the multivariate analysis.
The acidity is very interesting due to its perceived sensations. Besides there is no reference values for them, the outliers should be explored.
In this section we will explore the relationships among all variables in the dataset, directing the analysis to those that have seemed to be more relevant from the univariate analysis. We start this by calculating the correlation among all the attributes:
From the correlation matrix we notice the only relevant attribute to quality is the percentage of alcohol. There are some relevant correlation among other variables as well, but they don’t contribute to the target variable that we are interested in. Anyway, let’s keep looking deeper to this data plotting some scatterplot pairs considering only the attributes we found relevant in the univariate section:
The pair plots above summarize all the information we have gotten until now:
- The unbalanced distribution of quality, ie, there are more wines classified as 5 or 6 than those below and above these values.
- Quality is negatively correlated to the volatile acidity.
- Quality is positively correlated to the alcohol percentage.
- Other attributes that present some correlations are pH, density and chlorides, although not strong ones.
In our analysis this is the relationship that seems to be the most relevant so far. My first idea was to plot a scatter plot with a smooth line to analyze it deeply. However, this type of visualization is not appropriate when one of the features on the chart is categorical or discrete, as is the case of quality. A boxplot, though, brings so much more information in this case:
As it was expected to, there is some correlation between the alcohol percentage in the wine and its quality. However, as it’s clear to see from the scattered points, different quality levels are assigned when the wine has 11% of alcohol. This leads us to afirm that relevant correlation between alcohol and quality arises only when we are dealing with the bordering values.
Another way to check it out is building a radar chart with the alcohol statistics for each one of the quality levels:
From the chart above it becomes really clear that the wines assigned with a better quality in average have higher alcohol percentages.
From the scatterplots above we can see a weak negative correlation between density and pH, ie, the higher the pH the lower the density. Beyond this there is no other relevant information from these graphs, since the correlation is somehow equivalent for all levels of quality.
A future research may include the pH and density corresponding to the main hydrographic basins of the regions where the wines are produced, also associating these attributes with the “body” perception.
To analyse these attributes I choosed the total sulfur dioxide, since we have found before the free sulfur dioxide distribution has the same shape. Taking a look at the figure, specifically in chart 7, we can notice an outlier measure of sulfur dioxide around \(300 mg/{dm}~3\). In order to avoid it impacting the charts I chose to apply the logarithm, although I could also just adjust the limits of the x-axis.
Regarding the interpretation of the graph, it becomes evident that there is no correlation between these two variables, which was also shown on our first correlation matrix.
Most of the information we could gather from the dataset variables were already obtained in the previous section. From the previous analysis, alcohol is the attribute in our data most correlated to our target variable. In a lower rate, the volatile.acidity is also correlated. The next charts will try to explore the relationships among these three variables:
Previously we have found that the alcohol percentage correlation to quality levels is higher for bordering values. From the chart above we perceive it also becomes true when we correlate alcohol and the volatile acidity. Our next chart shows the boxplot for the correlation between the volatile acidity and different buckets of alcohol percentage, accordingly to each quality level.
In this section I intend to build a linear regression model that could allow us to estimate the wine quality from the two most important attributes we have found in our EDA.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wines)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = wines)
##
## ==========================================
## m1 m2
## ------------------------------------------
## (Intercept) 1,875*** 3,095***
## (0,175) (0,184)
## I(alcohol) 0,361*** 0,314***
## (0,017) (0,016)
## volatile.acidity -1,384***
## (0,095)
## ------------------------------------------
## R-squared 0,227 0,317
## adj. R-squared 0,226 0,316
## sigma 0,710 0,668
## F 468,267 370,379
## p 0,000 0,000
## Log-likelihood -1721,057 -1621,814
## Deviance 805,870 711,796
## AIC 3448,114 3251,628
## BIC 3464,245 3273,136
## N 1599 1599
## ==========================================
The resultant model was not as good as I expected to, with a final R-square measure of 32%. Since the objective of this project was limited on EDA, I will not try to engineer any feature in order to reach better results, but instead I will discuss some aspects that may explain the model insuccess:
Data is unbalanced: there are few samples of the higher and lower quality levels. Although it’s not a determinant factor in linear regression, in this case it may have impacted.
Weak correlation: the correlation among the attributes were weak, ie. none of them was greater than 50%.
Dependent attributes: for a better modelling, it would be better to target exactly what we were looking for: good wines, in the case. In this analysis we have considered the whole spectrum for quality, with the exception of some analyzes in which we sub-sampled the set for quality levels greater than or equal to 7, but whose findings were not relevant.
Subjectivity: this is the most interesting aspect to me. Unlike a diamond exploratory data analysis, for example, whose measures are limited to physicochemical factors, the main attribute of our dataset (quality) is totally subjective. In [3], for example, the influence of the human subjectivity on wine tasting is criticized. In addition, in our dataset it was not informed how many distinct professionals evaluated the same wine, nor the divergences they may have had, as well as other important aspects that should be considered.
This chart was suggested after my first project review and shows up the main result of this EDA. The alcohol percentage and the volatile acidity are the dataset attributes most related to the quality of red wines. In this chart, the alcohol percentage was plot through buckets allowing to perceive that low quality wines don’t have more than 12% of alcohol, while high quality wines don’t have less than 9% of alcohol. On the other hand, the volatile acidity is the one responsible for the vinegar unpleasant taste and one of the main characteristics of a bad wine. From this chart this negative correlation among volatile acidity and quality is also clear visible. For last, I have added the title and the units for each variable, turning this chart ready to be published.
In my opinion, this pairplot summarises the main aspects of the dataset I have worked with. From the univariate analysis I was able to identify the most relevant one and then to get an overview of the data by them. The diagonal presents the univariate distribution and upper it we have the correlation values among the attributes. It is the indispensable starting point for any EDA.
I confess the radarchart was the only visualization I found harder to build with R than with python. Even so, I spent a long time trying to find the appropriate package and finding out how to fill its parameters. The result, however, was exactly what I expected: a simple chart but one with the most evident relationships I found out in this dataset.
Regarding the purpose of this analysis, some reflections and discussions were included closer to each chart during the EDA process, in order to facilitate their comprehension. A final word on this, though, is that the EDA allowed not only to understand the dataset but also to identify its main characteristics, and two variables that are somehow correlated to the target variable, in this case the quality of the red wines.
Some questions aroused from this EDA that may be explored on future researches:
What is the winery of each evaluated wine? The reason to include such information is to correlate the wine density with the water density of the watersheds from the production region.
Using this same dataset, it would be interesting to improve the linear model by changing the approach to select the training data. From EDA we know the attributes we have chosen presen more correlation on the bordering values. In this case, it seems reasonably to model our linear regression taking into account this ranges, by defining some operation thresholds for the model, for example.
Also, it will be interesting to apply machine learning on this dataset, since the EDA is already done.
Regarding the technical aspects of EDA, I’m pretty sure R is now one more amazing tool I will put on my data scientist toolkit. Before I got enrolled in this course, and before putting my hands on dirty with this project, I thought someone should choose among python or R. Now I see both languages are plenty of resources and both can be used complementarily.
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at http://dx.doi.org/10.1016/j.dss.2009.05.016
[2] http://scitechconnect.elsevier.com/wp-content/uploads/2014/06/main-38.pdf
[3] https://www.theguardian.com/lifeandstyle/2013/jun/23/wine-tasting-junk-science-analysis